Cross-Validation

Hooman Sabarou & Mounika Chevva (Advisor: Dr. Seals)

Literature Review-Fall 2024

Introduction

  • Challenges in Cross-Validation: Traditional methods like k-fold and leave-one-out cross-validation face limitations like overfitting, variance, and inefficiency in large datasets.
  • Innovative Solutions: New techniques such as Cross-Validation Voting (CVV), Monte Carlo cross-validation (MCCV), and improved methods for selecting the right number of components (Wold, EK procedures) enhance model generalization, reduce overfitting, and improve error rate estimation.
  • Performance and Efficiency: Leveraging advanced methods like Parallel-CV and GPU-based parallelization, as well as alternative cross-validation approaches in multivariate models, significantly boosts efficiency and accuracy, especially for large or small datasets.

Literature Review

  • CNN models face challenges like overfitting and generalization issues in supermarket product classification.[1]
  • Ensemble learning strategies such as voting, boosting, and bagging are commonly used to improve performance but often rely on single classifiers or validation sets.
  • The Cross-Validation Voting (CVV) method proposed by Duque Domingo et al. improves generalization and reduces overfitting, outperforming traditional methods in grocery classification.[2],[3],[4]

Literature Review

  • Overview of Cross-Validation Methods: The paper covers data splitting techniques, including single hold-out, k-fold, and leave-one-out cross-validation. [5]
  • Pros and Cons: Each method’s advantages and limitations are discussed in relation to dataset size and complexity.
  • Key Concepts: The paper addresses overfitting, stratification, and how to select the final model based on cross-validation results. [[6]][7]

Literature Review

  • Focus on Model Generalizability: The paper highlights the issue of overfitting in statistical models and their poor performance on new data. [8]
  • Cross-Validation Techniques: It introduces k-fold and Monte Carlo cross-validation as methods to assess and improve generalizability, with practical implementation in R using the caret package.
  • Interactive Example: A hands-on Shiny app example illustrates how factors like model complexity, sample size, and effect size impact generalizability.

Literature Review

  • Cross-Validation in PCA: The paper examines how cross-validation techniques help select the correct number of components in principal component analysis (PCA) to avoid overfitting and improve prediction accuracy.[9]
  • Alternative Methods: It introduces the improved Wold and Eastment–Krzanowski (EK) procedures to address overfitting and bias in component models.
  • Performance of Methods: The study finds that the Eigenvector and EM methods outperform others, particularly for smaller datasets, emphasizing the importance of choosing the right cross-validation method based on dataset size and complexity.

Introduction to the Dataset

Martensite Starting Temperature

  • Materials Science Dataset about Steel
  • Martensite Starting Temperature (Ms in degree Celsius) & chemical elements (weight percent)
  • Depending on the chemistry of a steel, Ms changes
  • It is important as it controls strength of Steel
  • The data has 16 variables for 1543 observations

Application

Dataset

  • Data exploration
  • Key variables (Table 1)
Variable Min Max Mean Median SD
Ms (Martensite Start Temp) 310.00 784.00 601.80 605.00 120.00
C (Carbon) 0.00 1.46 0.36 0.33 0.10
Mn (Manganese) 0.00 4.95 0.79 0.69 0.30
Ni (Nickel) 0.00 27.20 1.56 0.15 0.50
Si (Silicon) 0.00 3.80 0.35 0.26 0.20
Cr (Chromium) 0.00 16.20 1.04 0.52 0.70

Methodology

  • Modeling Approach:
    • Untransformed Model: Directly modeled Ms using predictors like C, Mn, Ni, Si, Cr, with interaction terms.

    • Log-Transformed Model: Modeled log(Ms) to handle non-normality and stabilize variance, using the same predictors and interaction terms.

    • Model Improvements (Predictors’ Removal, Introducing Interaction Parameters, Outliers’ Removal)

    • Model Diagnostics (ANOVA, AIC, Cross-Validation, Check for Multicollinearity, Influential Points’ Removal)

    • Model Evaluation: The log-transformed model showed significantly better performance with a lower AIC and cross-validation MSE. Residual deviance and cross-validation confirmed that the log model generalized better to unseen data.

  • Cross-Validation Refinement:
    • K-Fold Cross-Validation with More Folds
    • Leave-One-Out Cross-Validation (LOOCV)
  • Programing has been done by R [10] in Rstudio (version 2024.04.2)
  • Utilized packages: tidyverse [11], classpackage [12], ggplot2 [13], psych [14], and boot [16]

Models

  • First Model:

Ms = 769.41 -286.71 C -16.42 Mn -14.04 Ni - 13.89 Si - 10.13Cr -41.45C:Mn - 8.36 C:Ni

Variables Mean ± SD Correlation Coefficient P-value
C 0.36 ± 0.1 -286.71 < 2e-16
Mn 0.79 ± 0.3 -16.42 1.36E-13
Ni 1.55 ± 0.5 -14.04 < 2e-16
Si 0.35 ± 0.2 -13.89 1.70E-13
Cr 1.04 ± 0.7 -10.13 < 2e-16
C:Mn N/A -41.45 < 2e-16
C:Ni N/A -8.36 9.68E-10
  • Summary of Model Metrics:
    • AIC:13545

    • BIC:1080321

    • R^2: 0.9016 (90.16%)

    • Adjusted R^2: 0.9010 (90.10%)

Second Model:

log(Ms) = -6.69 - 0.51C - 0.03 Mn - 0.03 Ni - 0.03 Si - 0.02Cr - 0.07 C:Mn - 0.01C:Ni

Variables Mean ± SD Correlation Coefficient P-value
C 0.36 ± 0.1 -0.51 < 2e-16
Mn 0.79 ± 0.3 -0.032 < 2e-16
Ni 1.55 ± 0.5 -0.0255 < 2e-16
Si 0.35 ± 0.2 -0.0226 4.48E-13
Cr 1.04 ± 0.7 -0.0175 < 2e-16
C:Mn N/A -0.0751 < 2e-16
C:Ni N/A -0.0154 1.01E-11

AVOVA CHECK

  • First Model

  • Second Model

Model Comparison

Model AIC BIC
I (Basic) 15699 2984481.84 0.753
II (Remove C=0) 15010 2506169.56 0.788
III (Remove Co, Mo-C:Mn) 14984 2465894.55 0.791
IV (Remove V, C:Ni) 14935 2384283.55 0.798
V (Log Model) -3578 72.79 0.808
VI (Influential Points Removal) 14751 2142102.54 0.816
VII (Influential Points Removal-Log) -3733.4 71.99 0.826
VIII (Outliers Removal) 13545 1080328.38 0.902
IX (Outlier Removal-Log) -4756.5 68.34 0.914
  • First Model

  • Second Model

Cross-Validation

Two kinds of cross-validation methods have been conducted: k-Fold and the Leave-One-Out Cross-Validation (LOOCV)

k-Fold

Interpretation

  • Stability Across Folds:

Both the 5-fold and 10-fold cross-validation results for the log-transformed model are extremely close, with very little variation between the fold types. This suggests that the log-transformed model is highly stable and performs consistently across different subsets of the data.

  • Comparison with Untransformed Model:

The cross-validation errors for the log-transformed model (~0.0021) are significantly lower than those of the untransformed model (~774 to ~780). This indicates that the log-transformed model likely fits the data better and generalizes more effectively.

LOOCV:

  • Predictive Accuracy:

    The log-transformed model performs better in terms of LOOCV error, suggesting it is more reliable for prediction.

  • Practical Use:

    If the purpose of a model is interpretability or making predictions on the original scale of Ms, the untransformed model may still be relevant despite the higher LOOCV error. However, for optimal prediction accuracy, the log-transformed model is superior based on these results.

Conclusion

  • The log-transformed model is the preferred choice based on k-fold cross-validation results. It demonstrates both lower prediction error and stability across different folds, making it a robust and accurate model for predicting the Martensite start temperature. Therefore, the log-transformed model should be selected as the final model for this project, as it provides more reliable predictions and handles the underlying data structure more effectively.

  • The log-transformed model shows a more stable and lower prediction error with LOOCV, supporting its choice as the better model in terms of predictive performance.

References

[1]
J. D. Domingo, R. M. Aparicio, and L. M. G. Rodrigo, “Cross validation voting for improving CNN classification in grocery products,” IEEE Access, vol. 10, pp. 20913–20925, 2022, doi: 10.1109/ACCESS.2022.3152224.
[2]
L. A. Yates, Z. Aandahl, S. A. Richards, and B. W. Brook, “Cross validation for model selection: A review with examples from ecology,” Ecological Monographs, vol. 93, no. 1, p. e1557, 2023, doi: 10.1002/ecm.1557.
[3]
C. Qi, J. Diao, and L. Qiu, “On estimating model in feature selection with cross-validation,” IEEE Access, vol. 7, pp. 33454–33463, 2019, doi: 10.1109/ACCESS.2019.2892062.
[4]
L. Lingqiao, Y. Huihua, H. Qian, Z. Jianbin, and G. Tuo, “Design and realization of the parallel computing framework of cross-validation,” in 2012 international conference on industrial control and electronics engineering, 2012, pp. 1957–1960. doi: 10.1109/ICICEE.2012.520.
[5]
D. Berrar, Cross-Validation,” in Encyclopedia of bioinformatics and computational biology, Elsevier, 2019, pp. 542–545.
[6]
M. Wentzien et al., “Machine learning‐based prediction of the martensite start temperature,” Steel Res. Int., Aug. 2024.
[7]
“Monte carlo cross validation,” Chemometrics and Intelligent Laboratory Systems, vol. 56, no. 1, pp. 1–11, 2001.
[8]
Q. C. Song, C. Tang, and S. Wee, “Making sense of model generalizability: A tutorial on cross-validation in r and shiny,” Advances in Methods and Practices in Psychological Science, vol. 4, no. 1, p. 2515245920947067, 2021.
[9]
K. Kjeldahl Smilde, “Cross-validation of component models: A critical look at current methods,” Analytical and Bioanalytical Chemistry, vol. 390, pp. 1241–1251, 2008.
[10]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2021. Available: https://www.R-project.org/
[11]
H. Wickham et al., “Welcome to the tidyverse,” Journal of Open Source Software, vol. 4, no. 43, p. 1686, 2019, doi: 10.21105/joss.01686.
[12]
I. Buker and S. Seals, Classpackage: Functions for intro statistics courses at the university of west florida. 2024. Available: https://github.com/ieb2/classpackage
[13]
H. Wickham, ggplot2: Elegant graphics for data analysis. Springer-Verlag New York, 2016. Available: https://ggplot2.tidyverse.org
[14]
William Revelle, Psych: Procedures for psychological, psychometric, and personality research. Evanston, Illinois: Northwestern University, 2024. Available: https://CRAN.R-project.org/package=psych
[15]
Angelo Canty and B. D. Ripley, Boot: Bootstrap r (s-plus) functions. 2024.
[16]
A. C. Davison and D. V. Hinkley, Bootstrap methods and their applications. Cambridge: Cambridge University Press, 1997. Available: doi:10.1017/CBO9780511802843